---
title: "Running Evaluations with SDK"
---

<Note>
  This guide is also available as a [Jupyter
  Notebook](https://github.com/Agenta-AI/agenta/blob/main/cookbook/evaluations_with_sdk.ipynb).
</Note>

## Introduction

In this guide, we'll demonstrate how to interact programmatically with evaluations in the Agenta platform using the SDK (or the raw API). This will include:

- Creating a test set
- Configuring an evaluator
- Running an evaluation
- Retrieving the results of evaluations

This assumes that you have already created an LLM application and variants in Agenta.

## Architectural Overview

Evaluations are executed on the Agenta backend. Specifically, Agenta invokes the LLM application for each row in the test set and processes the output using the designated evaluator. Operations are managed through Celery tasks. The interactions with the LLM application are asynchronous, batched, and include retry mechanisms. The batching configuration can be adjusted to avoid exceeding rate limits imposed by the LLM provider.

## Setup

### Installation

Ensure that the Agenta SDK is installed and up-to-date in your development environment:

```bash
pip install -U agenta
```

### Configuration

After setting up your environment, you need to configure the SDK:

```python
from agenta.client.backend.client import AgentaApi

# Set up your application ID and API key
app_id = "your_app_id"
api_key = "your_api_key"
host = "https://cloud.agenta.ai"

# Initialize the client
client = AgentaApi(base_url=host + "/api", api_key=api_key)
```

## Create a Test Set

You can create and update test sets as follows:

```python
from agenta.client.backend.types.new_testset import NewTestset

# Example data for the test set
csvdata = [
    {"country": "France", "capital": "Paris"},
    {"country": "Germany", "capital": "Berlin"}
]

# Create a new test set
response = client.testsets.create_testset(app_id=app_id, request=NewTestset(name="Test Set", csvdata=csvdata))
test_set_id = response.id
```

## Create Evaluators

Set up evaluators that will assess the performance based on specific criteria:

```python
# Create an exact match evaluator
response = client.evaluators.create_new_evaluator_config(
    app_id=app_id, name="Capital Evaluator", evaluator_key="auto_exact_match",
    settings_values={"correct_answer_key": "capital"}
)
exact_match_eval_id = response.id
```

## Run an Evaluation

Execute an evaluation using the previously defined test set and evaluators:

```python
from agenta.client.backend.types.llm_run_rate_limit import LlmRunRateLimit

response = client.evaluations.create_evaluation(
    app_id=app_id, variant_ids=["your_variant_id"], testset_id=test_set_id,
    evaluators_configs=[exact_match_eval_id],
    rate_limit=LlmRunRateLimit(batch_size=10, max_retries=3, retry_delay=2, delay_between_batches=5)
)
```

## Retrieve Results

After running the evaluation, fetch the results to see how well the model performed against the test set:

```python
results = client.evaluations.fetch_evaluation_results("your_evaluation_id")
print(results)
```

## Conclusion

This guide covers the basic steps for using the SDK to manage evaluations within Agenta.
